{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# First look at our dataset\n",
    "\n",
    "In this notebook, we look at the necessary steps required before any machine\n",
    " learning takes place. It involves:\n",
    "\n",
    "* loading the data;\n",
    "* looking at the variables in the dataset, in particular, differentiate\n",
    "  between numerical and categorical variables, which need different\n",
    "  preprocessing in most machine learning workflows;\n",
    "* visualizing the distribution of the variables to gain some insights into the\n",
    "  dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading the adult census dataset\n",
    "\n",
    "We use data from the 1994 US census that we downloaded from\n",
    "[OpenML](http://openml.org/).\n",
    "\n",
    "You can look at the OpenML webpage to learn more about this dataset:\n",
    "<http://www.openml.org/d/1590>\n",
    "\n",
    "The dataset is available as a CSV (Comma-Separated Values) file and we use\n",
    "`pandas` to read it.\n",
    "\n",
    "<div class=\"admonition note alert alert-info\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
    "<p class=\"last\"><a class=\"reference external\" href=\"https://pandas.pydata.org/\">Pandas</a> is a Python library used for\n",
    "manipulating 1 and 2 dimensional structured data. If you have never used\n",
    "pandas, we recommend you look at this\n",
    "<a class=\"reference external\" href=\"https://pandas.pydata.org/docs/user_guide/10min.html\">tutorial</a>.</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "adult_census = pd.read_csv(\"../datasets/adult-census.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The goal with this data is to predict whether a person earns over 50K a year\n",
    "from heterogeneous data such as age, employment, education, family\n",
    "information, etc."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The variables (columns) in the dataset\n",
    "\n",
    "The data are stored in a `pandas` dataframe. A dataframe is a type of\n",
    "structured data composed of 2 dimensions. This type of data is also referred\n",
    "as tabular data.\n",
    "\n",
    "Each row represents a \"sample\". In the field of machine learning or\n",
    "descriptive statistics, commonly used equivalent terms are \"record\",\n",
    "\"instance\", or \"observation\".\n",
    "\n",
    "Each column represents a type of information that has been collected and is\n",
    "called a \"feature\". In the field of machine learning and descriptive\n",
    "statistics, commonly used equivalent terms are \"variable\", \"attribute\", or\n",
    "\"covariate\"."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A quick way to inspect the dataframe is to show the first few lines with the\n",
    "`head` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "adult_census.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An alternative is to omit the `head` method. This would output the initial and\n",
    "final rows and columns, but everything in between is not shown by default. It\n",
    "also provides the dataframe's dimensions at the bottom in the format `n_rows`\n",
    "x `n_columns`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "adult_census"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The column named **class** is our target variable (i.e., the variable which we\n",
    "want to predict). The two possible classes are `<=50K` (low-revenue) and\n",
    "`>50K` (high-revenue). The resulting prediction problem is therefore a binary\n",
    "classification problem as `class` has only two possible values. We use the\n",
    "left-over columns (any column other than `class`) as input variables for our\n",
    "model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "target_column = \"class\"\n",
    "adult_census[target_column].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"admonition note alert alert-info\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
    "<p>Here, classes are slightly imbalanced, meaning there are more samples of one\n",
    "or more classes compared to others. In this case, we have many more samples\n",
    "with <tt class=\"docutils literal\">\" &lt;=50K\"</tt> than with <tt class=\"docutils literal\">\" &gt;50K\"</tt>. Class imbalance happens often in practice\n",
    "and may need special techniques when building a predictive model.</p>\n",
    "<p class=\"last\">For example in a medical setting, if we are trying to predict whether subjects\n",
    "may develop a rare disease, there would be a lot more healthy subjects than\n",
    "ill subjects in the dataset.</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The dataset contains both numerical and categorical data. Numerical values\n",
    "take continuous values, for example `\"age\"`. Categorical values can have a\n",
    "finite number of values, for example `\"native-country\"`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "numerical_columns = [\n",
    "    \"age\",\n",
    "    \"education-num\",\n",
    "    \"capital-gain\",\n",
    "    \"capital-loss\",\n",
    "    \"hours-per-week\",\n",
    "]\n",
    "categorical_columns = [\n",
    "    \"workclass\",\n",
    "    \"education\",\n",
    "    \"marital-status\",\n",
    "    \"occupation\",\n",
    "    \"relationship\",\n",
    "    \"race\",\n",
    "    \"sex\",\n",
    "    \"native-country\",\n",
    "]\n",
    "all_columns = numerical_columns + categorical_columns + [target_column]\n",
    "\n",
    "adult_census = adult_census[all_columns]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can check the number of samples and the number of columns available in the\n",
    "dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\n",
    "    f\"The dataset contains {adult_census.shape[0]} samples and \"\n",
    "    f\"{adult_census.shape[1]} columns\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can compute the number of features by counting the number of columns and\n",
    "subtract 1, since one of the columns is the target."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"The dataset contains {adult_census.shape[1] - 1} features.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Visual inspection of the data\n",
    "Before building a predictive model, it is a good idea to look at the data:\n",
    "\n",
    "* maybe the task you are trying to achieve can be solved without machine\n",
    "  learning;\n",
    "* you need to check that the information you need for your task is actually\n",
    "  present in the dataset;\n",
    "* inspecting the data is a good way to find peculiarities. These can arise\n",
    "  during data collection (for example, malfunctioning sensor or missing\n",
    "  values), or from the way the data is processed afterwards (for example\n",
    "  capped values)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at the distribution of individual features, to get some insights\n",
    "about the data. We can start by plotting histograms, note that this only works\n",
    "for features containing numerical values:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "_ = adult_census.hist(figsize=(20, 14))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"admonition tip alert alert-warning\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Tip</p>\n",
    "<p class=\"last\">In the previous cell, we used the following pattern: <tt class=\"docutils literal\">_ = func()</tt>. We do this\n",
    "to avoid showing the output of <tt class=\"docutils literal\">func()</tt> which in this case is not that\n",
    "useful. We actually assign the output of <tt class=\"docutils literal\">func()</tt> into the variable <tt class=\"docutils literal\">_</tt>\n",
    "(called underscore). By convention, in Python the underscore variable is used\n",
    "as a \"garbage\" variable to store results that we are not interested in.</p>\n",
    "</div>\n",
    "\n",
    "We can already make a few comments about some of the variables:\n",
    "\n",
    "* `\"age\"`: there are not that many points for `age > 70`. The dataset\n",
    "  description does indicate that retired people have been filtered out\n",
    "  (`hours-per-week > 0`);\n",
    "* `\"education-num\"`: peak at 10 and 13, hard to tell what it corresponds to\n",
    "  without looking much further. We'll do that later in this notebook;\n",
    "* `\"hours-per-week\"` peaks at 40, this was very likely the standard number of\n",
    "  working hours at the time of the data collection;\n",
    "* most values of `\"capital-gain\"` and `\"capital-loss\"` are close to zero."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For categorical variables, we can look at the distribution of values:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "adult_census[\"sex\"].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the data collection process resulted in an important imbalance\n",
    "between the number of male/female samples.\n",
    "\n",
    "Be aware that training a model with such data imbalance can cause\n",
    "disproportioned prediction errors for the under-represented groups. This is a\n",
    "typical cause of\n",
    "[fairness](https://docs.microsoft.com/en-us/azure/machine-learning/concept-fairness-ml#what-is-machine-learning-fairness)\n",
    "problems if used naively when deploying a machine learning based system in a\n",
    "real life setting.\n",
    "\n",
    "We recommend our readers to refer to [fairlearn.org](https://fairlearn.org)\n",
    "for resources on how to quantify and potentially mitigate fairness issues\n",
    "related to the deployment of automated decision making systems that rely on\n",
    "machine learning components.\n",
    "\n",
    "Studying why the data collection process of this dataset lead to such an\n",
    "unexpected gender imbalance is beyond the scope of this MOOC but we should\n",
    "keep in mind that this dataset is not representative of the US population\n",
    "before drawing any conclusions based on its statistics or the predictions of\n",
    "models trained on it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "adult_census[\"education\"].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 0
   },
   "source": [
    "As noted above, `\"education-num\"` distribution has two clear peaks around 10\n",
    "and 13. It would be reasonable to expect that `\"education-num\"` is the number\n",
    "of years of education.\n",
    "\n",
    "Let's look at the relationship between `\"education\"` and `\"education-num\"`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.crosstab(\n",
    "    index=adult_census[\"education\"], columns=adult_census[\"education-num\"]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For every entry in `\\\"education\\\"`, there is only one single corresponding\n",
    "value in `\\\"education-num\\\"`. This shows that `\"education\"` and\n",
    "`\"education-num\"` give you the same information. For example,\n",
    "`\"education-num\"=2` is equivalent to `\"education\"=\"1st-4th\"`. In practice that\n",
    "means we can remove `\"education-num\"` without losing information. Note that\n",
    "having redundant (or highly correlated) columns can be a problem for machine\n",
    "learning algorithms."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"admonition note alert alert-info\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
    "<p class=\"last\">In the upcoming notebooks, we will only keep the <tt class=\"docutils literal\">\"education\"</tt> variable,\n",
    "excluding the <tt class=\"docutils literal\"><span class=\"pre\">\"education-num\"</span></tt> variable since the latter is redundant with\n",
    "the former.</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another way to inspect the data is to do a `pairplot` and show how each\n",
    "variable differs according to our target, i.e. `\"class\"`. Plots along the\n",
    "diagonal show the distribution of individual variables for each `\"class\"`. The\n",
    "plots on the off-diagonal can reveal interesting interactions between\n",
    "variables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import seaborn as sns\n",
    "\n",
    "# We plot a subset of the data to keep the plot readable and make the plotting\n",
    "# faster\n",
    "n_samples_to_plot = 5000\n",
    "columns = [\"age\", \"education-num\", \"hours-per-week\"]\n",
    "_ = sns.pairplot(\n",
    "    data=adult_census[:n_samples_to_plot],\n",
    "    vars=columns,\n",
    "    hue=target_column,\n",
    "    plot_kws={\"alpha\": 0.2},\n",
    "    height=3,\n",
    "    diag_kind=\"hist\",\n",
    "    diag_kws={\"bins\": 30},\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating decision rules by hand\n",
    "\n",
    "By looking at the previous plots, we could create some hand-written rules that\n",
    "predict whether someone has a high- or low-income. For instance, we could\n",
    "focus on the combination of the `\"hours-per-week\"` and `\"age\"` features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "_ = sns.scatterplot(\n",
    "    x=\"age\",\n",
    "    y=\"hours-per-week\",\n",
    "    data=adult_census[:n_samples_to_plot],\n",
    "    hue=target_column,\n",
    "    alpha=0.5,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The data points (circles) show the distribution of `\"hours-per-week\"` and\n",
    "`\"age\"` in the dataset. Blue points mean low-income and orange points mean\n",
    "high-income. This part of the plot is the same as the bottom-left plot in the\n",
    "pairplot above.\n",
    "\n",
    "In this plot, we can try to find regions that mainly contains a single class\n",
    "such that we can easily decide what class one should predict. We could come up\n",
    "with hand-written rules as shown in this plot:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "ax = sns.scatterplot(\n",
    "    x=\"age\",\n",
    "    y=\"hours-per-week\",\n",
    "    data=adult_census[:n_samples_to_plot],\n",
    "    hue=target_column,\n",
    "    alpha=0.5,\n",
    ")\n",
    "\n",
    "age_limit = 27\n",
    "plt.axvline(x=age_limit, ymin=0, ymax=1, color=\"black\", linestyle=\"--\")\n",
    "\n",
    "hours_per_week_limit = 40\n",
    "plt.axhline(\n",
    "    y=hours_per_week_limit, xmin=0.18, xmax=1, color=\"black\", linestyle=\"--\"\n",
    ")\n",
    "\n",
    "plt.annotate(\"<=50K\", (17, 25), rotation=90, fontsize=35)\n",
    "plt.annotate(\"<=50K\", (35, 20), fontsize=35)\n",
    "_ = plt.annotate(\"???\", (45, 60), fontsize=35)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* In the region `age < 27` (left region) the prediction is low-income. Indeed,\n",
    "  there are many blue points and we cannot see any orange points.\n",
    "* In the region `age > 27 AND hours-per-week < 40` (bottom-right region), the\n",
    "  prediction is low-income. Indeed, there are many blue points and only a few\n",
    "  orange points.\n",
    "* In the region `age > 27 AND hours-per-week > 40` (top-right region), we see\n",
    "  a mix of blue points and orange points. It seems complicated to choose which\n",
    "  class we should predict in this region.\n",
    "\n",
    "It is interesting to note that some machine learning models work similarly to\n",
    "what we did: they are known as decision tree models. The two thresholds that\n",
    "we chose (27 years and 40 hours) are somewhat arbitrary, i.e. we chose them by\n",
    "only looking at the pairplot. In contrast, a decision tree chooses the \"best\"\n",
    "splits based on data without human intervention or inspection. Decision trees\n",
    "will be covered more in detail in a future module.\n",
    "\n",
    "Note that machine learning is often used when creating rules by hand is not\n",
    "straightforward. For example because we are in high dimension (many features\n",
    "in a table) or because there are no simple and obvious rules that separate the\n",
    "two classes as in the top-right region of the previous plot.\n",
    "\n",
    "To sum up, the important thing to remember is that in a machine-learning\n",
    "setting, a model automatically creates the \"rules\" from the existing data in\n",
    "order to make predictions on new unseen data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Notebook Recap\n",
    "\n",
    "In this notebook we:\n",
    "\n",
    "* loaded the data from a CSV file using `pandas`;\n",
    "* looked at the different kind of variables to differentiate between\n",
    "  categorical and numerical variables;\n",
    "* inspected the data with `pandas` and `seaborn`. Data inspection can allow\n",
    "  you to decide whether using machine learning is appropriate for your data\n",
    "  and to highlight potential peculiarities in your data.\n",
    "\n",
    "We made important observations (which will be discussed later in more detail):\n",
    "\n",
    "* if your target variable is imbalanced (e.g., you have more samples from one\n",
    "  target category than another), you may need to be careful when interpreting\n",
    "  the values of performance metrics;\n",
    "* columns can be redundant (or highly correlated), which is not necessarily a\n",
    "  problem, but may require special treatment as we will cover in future\n",
    "  notebooks;\n",
    "* decision trees create prediction rules by comparing each feature to a\n",
    "  threshold value, resulting in decision boundaries that are always parallel\n",
    "  to the axes. In 2D, this means the boundaries are vertical or horizontal\n",
    "  line segments at the feature threshold values."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}